Skip to content

Conversation

@bghira
Copy link

@bghira bghira commented Oct 29, 2025

this ports the CUDA NF4 support to Metal.

so far, I've targeted nf4 quant/dequant because it's one of the least-accessible formats for Mac users.

we're using uint8 under the hood. for what it's worth, Metal (and the underlying hardware) lacks fp8/fp4 support.

performance has not been the forefront of this effort, as most of the time was spent determining how to plug metallib into bitsandbytes and correctly build it.

I'd like some feedback on this approach, because due to my inexperience with your build toolchain, it's highly likely I've done things in ways that can be improved.

I'm building on lessons I'd learnt while building a pytorch custom op for universal-metal-flash-attention, namely the way the MTLBuffers are retrieved from torch MPSGraph objects, which required the use of the torch headers.

@bghira
Copy link
Author

bghira commented Oct 29, 2025

@rickardp cc

@matthewdouglas matthewdouglas self-assigned this Oct 29, 2025
@matthewdouglas matthewdouglas self-requested a review October 29, 2025 17:44
@bghira
Copy link
Author

bghira commented Nov 11, 2025

@matthewdouglas if we can get this reviewed and merged, i can continue adding Metal support.

@matthewdouglas
Copy link
Member

Hi @bghira

For the time being, I'd like to avoid linking in libtorch. Doing so adds more complexity to our packaging process, and IMO it is not worth it for only this. We aim to support a reasonably broad range of PyTorch versions; with linking to libtorch we'd have to either build for each of those and somehow distribute them all, or pin the version of PyTorch.

Eventually we'll consider using the newer LibTorch Stable ABI, but that would be quite a while away, and we're only going to do it if there's very clear benefit to doing so.

Instead, I ask that native code is built independently of PyTorch, and ideally exposes a C API that can be used in the same way that the CPU, CUDA, ROCm, and XPU backends work. Ideally we would not build as a Python extension that links to cpython either, but if so, I would also ask that it uses one build for all Python 3.10+ versions, e.g. with the Stable ABI.

I'm not super familiar here, but it seems the main things we need out of Torch are:

  • torch::mps::get_dispatch_queue()
  • torch::mps::get_command_buffer()
  • torch::mps::commit()

It would be great to find a way around that. As mentioned on Discord, torch.mps.compile_shader() may also be an option. Do note that you can still write the shader in a separate file and read it in.

I understand that this is a little different then most PyTorch extensions with native code, but it's intentional to help us with distribution and broad compatibility.

@bghira bghira closed this Nov 12, 2025
@bghira
Copy link
Author

bghira commented Nov 12, 2025

understood; Metal support is basically not possible to take any further in this library without substantial performance drawbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants